Dupes (advanced)
Dupes, or Duplicates, are values that match other existing values in columns and can be set to detect exact and fuzzy matches. Both exact and fuzzy support case-insensitive matching.
Exact matches detect entries that may match despite formatting issues. For example, Collibra DQ flags John Doe
, JOHN DOE
, and john doe
as exact match dupes of each other. Exact match groups the same dupes and displays their counts in the occurs column on the Findings page.
Fuzzy matches detect entries that may match despite misspellings, common spelling variations, and more. For example, Collibra DQ flags John Doe
, Johnny Doe
, and Johhnn Doe
as fuzzy match dupes of each other. The fuzzy match algorithm complexity is O² and compares every two fields in the dataset.
When fuzzy match runs on a dataset that is expected to have only exact matches, the dupes result will be greater than the size of the dataset because it compares every field with the rest of the fields on the list.
General Ledger. Accounting use-case
Whether you're looking for a fuzzy matching percent or single client cleanup, Collibra DQ's duplicate detection can help you sort and rank the likelihood of duplicate data.
-f "file:///home/ec2-user/single_customer.csv" \
-d "," \
-ds customers \
-rd 2018-01-08 \
-dupe \
-dupenocase \
-depth 4
User Table has duplicate user entry
Carrisa Rimmer vs Carrissa Rimer
ATM customer data with only a 88% match
As you can see below, less than a 90% match in most cases is a false positive. Each dataset is a bit different, but in many cases you should tune your duplicates to roughly a 90+% match for interesting findings.
Simple DataFrame Example
Known Limitations
There is a known limitation where you cannot perform a dupes check on all columns or more than 10 columns at once.